knitr::opts_chunk$set(cache=TRUE, echo=F, warning=F, error = F, message=F)
knitr::opts_knit$set(root.dir = "/users/scottsfarley/documents")
setwd("/users/scottsfarley/documents")
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
library(akima)
library(ggplot2)
options(java.parameters = "-Xmx1500m")
library(bartMachine)
## Loading required package: rJava
## Loading required package: bartMachineJARs
## Loading required package: car
## Warning: replacing previous import 'lme4::sigma' by 'stats::sigma' when
## loading 'pbkrtest'
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: missForest
## Loading required package: itertools
## Welcome to bartMachine v1.2.3! You have 1.4GB memory available.
##
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
bartMachine::set_bart_machine_num_cores(3)
## bartMachine now using 3 cores.
library(reshape2)
library(ggdendro)
threshold.time <- 20 ##seconds
threshold.cost <- Inf ##cents
threshold.numTex <- 45
First, get the training data and fit the model. Perform some skill checks on it.
## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done
## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done
Choose a finite number of possible solutions to the model. Ideally, we would want every single combination of predictor variables [0, Inf]. This is obviously intractable. Moreover, I only have data for a subset of that space anyways. So randomly sample the subspace in which I have data to make the problem possible to solve.
Using that subset of data and the models we fit previously, predict each candidate configuration of algorithm inputs and hardware variables for execution time and SDM accuracy.
Plot the posterior means of the accuracy models against the algorithm inputs that should control accuracy. In this case, these are number of training examples and number of covariates.
The accuracy clearly varies from low (few training examples and few covariates) to very high (many covariates, many training examples). Perhaps more data would be helpful here, but what are you going to do. Our task is to find the combinations of inputs that results in the highest accuracy model. If there’s a tie, find the combination that needs the least data.
Now, we know the combination of algorithm inputs that result in the highest accuracy. The figure below shows the combination identified on the training examples and covariates axes. This combination of training examples and number of covariates can be run on any combination of hardware. Some might be suboptimal. Thus, at this point, we’ve solved half of our challenge: algorithm inputs have been optimized, now it’s time optimize hardware.
## [1] "Accuracy is maximized at 9000 training examples and 5 predictors."
In theory, the hardware parameters should not affect the SDM accuracy. We can test this assumption here, by plotting the accuracies obtained for this combination of algorithm inputs against modeled accuracy on the number of CPUs and amount of memory. If the assumption is valid, the plot should show no change in either the horizontal or vertical directions. We see that there is, in fact, some change, though. This is likely due to expeirmental design, and lack of a full factorial design setup. The effect is realtively minor, and I choose to comment it and move along.
## [1] "Accuracy Range on Hardware: 0.0229455489422251"
## [1] "Accuracy Range from Expectation: 0"
## [1] "------"
## [1] "Fixing accuracy at: 0.713110150877667"
Now, fix the algorithm inputs at the accuracy-maximizing point– effectively fixing expected model accuracy. An algorithm with these inputs can be run on any combination of hardware. Project how long that specific model would take and how much it would cost on all computing types. Plot those out on time vs. cost axes.
The optimal solution is the one that balances time and cost equally during the minimization. We use euclidean distance here, which normalizes each dimension by its standard deviation, so they are weighted equally. For each candidate combiantion of hardware, we calculate the distance between it and the origin of these two axes. We then find the minimum of that distance matrix and call that point the optimal.
Our job is complete. We’ve now optimized both the harware and software dimensions of the problem.
## [1] "------GAM OPTIMAL--------"
## [1] "Predicted Optimal Accuracy 0.713110150877667 +/- 0"
## [1] "Predicted Optimal Cost (seconds) 5.30210844237311"
## [1] "Predicted Optimal Cost (cents) 0.318762759555471"
## [1] "Cores: 2"
## [1] "Memory: 2"
## [1] "Training Examples: 9000"
## [1] "Covariates: 5"
Everything up to this point was done using the mean of the posterior distribution, a choice which simplifies the process but causes some information loss and may cause over-confidence in the predictions. We can modify our steps to include information from the entire posterior, which may solve this issue.
Instead of projecting just the mean time and mean cost for use the the distance minimization, use the entire set of posterior samples. Calculate the distance metric for each sample in the posterior independently. You’re then left with a density distribution of distances, from which we can infer the minimum value.
The posteriors are in a line, since there’s a fixed linear relationship between time and cost.
Now, find the distance metrics for all of those points.
There’s a lot of overlab in this figure, and many points are far away from the optimal. We don’t care about those. Take the few closest to the minimum and look at their distributions.
Now, the optimal configuration may be one of the following:
| config | cores | GBMemory | seconds | cost | distance.mean | distance.sd |
|---|---|---|---|---|---|---|
| 13 | 2 | 2 | 5.302108 | 0.3187628 | 5.330181 | 0.4410313 |
| 25 | 3 | 2 | 5.302108 | 0.4781441 | 5.342165 | 0.4420229 |
| 37 | 4 | 2 | 5.302108 | 0.6375255 | 5.358898 | 0.4434074 |
| 1 | 1 | 2 | 5.524324 | 0.1660612 | 5.529001 | 0.1526480 |
| 3 | 1 | 5 | 5.530293 | 0.3324812 | 5.555245 | 0.4078013 |
| 14 | 2 | 4 | 5.513988 | 0.5525016 | 5.557669 | 0.4251251 |
| 4 | 1 | 6 | 5.530293 | 0.3878948 | 5.558856 | 0.4080664 |
| 5 | 1 | 8 | 5.530293 | 0.4987218 | 5.567735 | 0.4087182 |
| 6 | 1 | 10 | 5.530293 | 0.6095489 | 5.578814 | 0.4095315 |
| 7 | 1 | 12 | 5.530293 | 0.7203760 | 5.592080 | 0.4105053 |
| 26 | 3 | 4 | 5.513988 | 0.8287524 | 5.592090 | 0.4277581 |
| 8 | 1 | 14 | 5.530293 | 0.8312030 | 5.607517 | 0.4116385 |
| 9 | 1 | 16 | 5.530293 | 0.9420301 | 5.625107 | 0.4129298 |
| 38 | 4 | 4 | 5.513988 | 1.1050032 | 5.639927 | 0.4314173 |
| 10 | 1 | 18 | 5.530293 | 1.0528572 | 5.644830 | 0.4143777 |
| 11 | 1 | 20 | 5.530293 | 1.1636843 | 5.666665 | 0.4159805 |
| 12 | 1 | 22 | 5.530293 | 1.2745113 | 5.690586 | 0.4177365 |
| 49 | 5 | 2 | 5.818556 | 0.8745290 | 5.900384 | 0.4391454 |
| 61 | 6 | 2 | 5.818556 | 1.0494349 | 5.928991 | 0.4412745 |
| 73 | 7 | 2 | 5.818556 | 1.2243407 | 5.962622 | 0.4437775 |
| 85 | 8 | 2 | 5.818556 | 1.3992465 | 6.001192 | 0.4466482 |
| 97 | 9 | 2 | 5.818556 | 1.5741523 | 6.044608 | 0.4498795 |
| 109 | 10 | 2 | 5.818556 | 1.7490581 | 6.092766 | 0.4534637 |
| 121 | 11 | 2 | 5.818556 | 1.9239639 | 6.145554 | 0.4573926 |
| 133 | 12 | 2 | 5.818556 | 2.0988697 | 6.202854 | 0.4616572 |
| config | cores | GBMemory | seconds | cost | distance.mean | distance.sd | cluster |
|---|---|---|---|---|---|---|---|
| 13 | 2 | 2 | 5.302108 | 0.3187628 | 5.330181 | 0.4410313 | 1 |
| 25 | 3 | 2 | 5.302108 | 0.4781441 | 5.342165 | 0.4420229 | 1 |
| 37 | 4 | 2 | 5.302108 | 0.6375255 | 5.358898 | 0.4434074 | 1 |
In the results above, you’re accutally seeing the trade off between time and money play out quite nicely. Adding cores costs money, but, in the case of GAMs, reduces time. Here, that tradeoff basically exactly evens out.
threshold.numTexIn this case, we’ve got a constraint on the amount of data available to us.
## [1] "Current data threshold is 45"
## [1] "Accuracy is maximized at 44 training examples and 5 predictors."
## [1] "Expected Max Accuracy is 0.677619581671622"
## [1] "Now there are only: 287 candidates, instead of 287 candidates that can complete this scenario under budget."
## [1] "Recommended # cores: 2"
## [1] "Recommended Memory: 2"
## [1] "Expected Cost: 0.314118049911889"
## [1] "Expected Seconds: 5.22485112960561"
| config | cores | GBMemory | seconds | cost | distance.mean | distance.sd |
|---|---|---|---|---|---|---|
| 13 | 2 | 2 | 5.302108 | 0.3187628 | 5.330181 | 0.4410313 |
| 25 | 3 | 2 | 5.302108 | 0.4781441 | 5.342165 | 0.4420229 |
| 37 | 4 | 2 | 5.302108 | 0.6375255 | 5.358898 | 0.4434074 |
| 3 | 1 | 5 | 5.530293 | 0.3324812 | 5.555245 | 0.4078013 |
| 14 | 2 | 4 | 5.513988 | 0.5525016 | 5.557669 | 0.4251251 |
| 4 | 1 | 6 | 5.530293 | 0.3878948 | 5.558856 | 0.4080664 |
| 5 | 1 | 8 | 5.530293 | 0.4987218 | 5.567735 | 0.4087182 |
| 6 | 1 | 10 | 5.530293 | 0.6095489 | 5.578814 | 0.4095315 |
| 7 | 1 | 12 | 5.530293 | 0.7203760 | 5.592080 | 0.4105053 |
| 26 | 3 | 4 | 5.513988 | 0.8287524 | 5.592090 | 0.4277581 |
| 8 | 1 | 14 | 5.530293 | 0.8312030 | 5.607517 | 0.4116385 |
| 9 | 1 | 16 | 5.530293 | 0.9420301 | 5.625107 | 0.4129298 |
| 10 | 1 | 18 | 5.530293 | 1.0528572 | 5.644830 | 0.4143777 |
| 11 | 1 | 20 | 5.530293 | 1.1636843 | 5.666665 | 0.4159805 |
| 12 | 1 | 22 | 5.530293 | 1.2745113 | 5.690586 | 0.4177365 |
| config | cores | GBMemory | seconds | cost | distance.mean | distance.sd | cluster |
|---|---|---|---|---|---|---|---|
| 13 | 2 | 2 | 5.302108 | 0.3187628 | 5.330181 | 0.4410313 | 1 |
| 25 | 3 | 2 | 5.302108 | 0.4781441 | 5.342165 | 0.4420229 | 1 |
| 37 | 4 | 2 | 5.302108 | 0.6375255 | 5.358898 | 0.4434074 | 1 |
threshold.time| config | cores | GBMemory | seconds | cost | distance.mean | distance.sd |
|---|---|---|---|---|---|---|
| 13 | 2 | 2 | 5.302108 | 0.3187628 | 1.671116 | 0.0840417 |
| 25 | 3 | 2 | 5.302108 | 0.4781441 | 1.674874 | 0.0842307 |
| 37 | 4 | 2 | 5.302108 | 0.6375255 | 1.680120 | 0.0844945 |
| 3 | 1 | 5 | 5.530293 | 0.3324812 | 1.713329 | 0.0737160 |
| 4 | 1 | 6 | 5.530293 | 0.3878948 | 1.714443 | 0.0737639 |
| 14 | 2 | 4 | 5.513988 | 0.5525016 | 1.715837 | 0.0764087 |
| 5 | 1 | 8 | 5.530293 | 0.4987218 | 1.717181 | 0.0738817 |
| 6 | 1 | 10 | 5.530293 | 0.6095489 | 1.720598 | 0.0740287 |
| 7 | 1 | 12 | 5.530293 | 0.7203760 | 1.724689 | 0.0742048 |
| 26 | 3 | 4 | 5.513988 | 0.8287524 | 1.726464 | 0.0768819 |
| 8 | 1 | 14 | 5.530293 | 0.8312030 | 1.729450 | 0.0744096 |
| 9 | 1 | 16 | 5.530293 | 0.9420301 | 1.734875 | 0.0746430 |
| 10 | 1 | 18 | 5.530293 | 1.0528572 | 1.740958 | 0.0749048 |
| 11 | 1 | 20 | 5.530293 | 1.1636843 | 1.747693 | 0.0751945 |
| 12 | 1 | 22 | 5.530293 | 1.2745113 | 1.755070 | 0.0755119 |
| config | cores | GBMemory | seconds | cost | distance.mean | distance.sd | cluster |
|---|---|---|---|---|---|---|---|
| 13 | 2 | 2 | 5.302108 | 0.3187628 | 1.671116 | 0.0840417 | 1 |
| 25 | 3 | 2 | 5.302108 | 0.4781441 | 1.674874 | 0.0842307 | 1 |
| 37 | 4 | 2 | 5.302108 | 0.6375255 | 1.680120 | 0.0844945 | 1 |